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Abstract — Let A be an n x m matrix with m > n, and suppose 
that the underdetermined linear system As = x admits a sparse 
solution so for which ||so||o < |sparJc(A). Such a sparse solution 
is unique due to a well-known uniqueness theorem. Suppose 
now that we have somehow a solution s as an estimation of 
so, and suppose that s is only 'approximately sparse', that is, 
many of its components are very small and nearly zero, but 
not mathematically equal to zero. Is such a solution necessarily 
close to the true sparsest solution? More generally, is it possible 
to construct an upper bound on the estimation error || s — so || 2 
without knowing s ? The answer is positive, and in this paper 
we construct such a bound based on minimal singular values of 
submatrices of A. We will also state a tight bound, which is more 
complicated, but besides being tight, enables us to study the case 
of random dictionaries and obtain probabilistic upper bounds. 
We will also study the noisy case, that is, where x = As + n. 
Moreover, we will see that where ||so ||o grows, to obtain a 
predetermined guaranty on the maximum of |j s — so |] 2, § is needed 
to be sparse with a better approximation. This can be seen as 
an explanation to the fact that the estimation quality of sparse 
recovery algorithms degrades where ||so||o grows. 

Index Terms — Atomic Decomposition, Compressed Sensing 
(CS), Sparse Component Analysis (SCA), Sparse decomposition, 
Overcomplete Signal Representation. 



I. Introduction and problem statement 

SPARSE solution of underdetermined systems of linear 
equations has recently attracted the attention of many 
researchers from different viewpoints, because of its potential 
applications in many different problems. It is used, for exam- 
ple, in Compressed Sensing (CS) [1], \2\, |3|, underdetermined 
Sparse Component Analysis (SCA) and source separation |4j, 
0, @, Q, atomic decomposition on overcomplete dictio- 
naries (8), J9), decoding real field codes |10|, image decon- 
volution ifTTl . |[T2l . image denoising fl3l . electromagnetic 
imaging and Direction of Arrival (DOA) finding [14], etc. 
The importance of sparse solutions of underdetermined linear 
systems comes from the fact that although such systems have 
generally an infinite number of solutions, their sparse solutions 
may be unique. 

Let A = [ai, . . . , a m ] be an n x m matrix with m > n, 
where a/s, i = 1, . . . , m denote its columns, and consider the 
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Underdetermined System of Linear Equations (USLE) 

As = x. (1) 

By the sparsest solution of the above system one means a 
solution s which has as small as possible number of nonzero 
components. In signal (or atomic) decomposition viewpoint, x 
is a signal which is to be decomposed as a linear combination 
of the signals a/s, i = 1, .. . ,m, and hence, a^'s are usually 
called lfT31 'atoms', and A is called the 'dictionary' over 
which the signal is to be decomposed. When the dictionary 
is overcomplete (m > n), the representation is not unique, but 
by the sparsest solution, we are looking for the representation 
which uses as small as possible number of atoms to represent 
the signal. 

It has been shown fl4l . Ifl6l . ifTTl l that if ([T} has a sparse 
enough solution, it is its unique sparsest solution. More 
precisely: 

Theorem 1 (Uniqueness Theorem lfl6l . IfTTll ). Let spark(A) 
denote the minimum number of columns of A. that are linearly 
dependent, and |j • ||o denotes the (° norm of a vector (i.e. the 
number of its nonzero components). Then if the USLE As = x 
has a solution So for which ||so||o < \spark{A), it is its unique 
sparsest solution. 

A special case of this uniqueness theorem has also been 
stated in fT4"j|: if A satisfies the Unique Representation Prop- 
erty (URP), that is, if all n x n submatrices of A are non- 
singular, then spark(A) = n+1 and hence ||so||o < § implies 
that So is the unique sparsest solution. 

Although the sparsest solution of (JTJ may be unique, finding 
this solution requires a combinatorial search and is generally 
NP-hard. Then, many different sparse recovery algorithms 
have been proposed to find an estimation of s , for ex- 
ample, Basis Pursuit (BP) O, Matching Pursuit (MP) lfT31 . 
FOCUSS El, Smoothed LO (SLO) flJD, Q3, SPGL1 |20|, 
IDE ED, ISD E2J, etc. 

Now, consider the following two different cases: 

• Exact sparsity: We say that a vector s is sparse in the 
exact sense if many of its components are exactly equal 
to zero. More precisely, s is said to be /c-sparse in the 
exact sense if it has at most k nonzero entries (and all 
other entries are exactly equal to zero). 

• Approximate Sparsity: We say that a vector s is sparse in 
the approximate sense if many of its components are very 
small and approximately equal to zero (but not necessarily 
'exactly' equal to zero). More precisely, s is said to be 
fc-sparse with approximation e if it has at most k entries 
with magnitudes larger than e (all of its other entries have 
magnitudes smaller than e). 
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Some of the sparse recovery algorithms (e.g. BP based on 
Simplex linear programming) return estimations which are 
sparse in the exact sense, while some others (e.g. MP with 
large enough iterations, SLO, FOCUSS and SPGL1) return 
solutions which are sparse only in the approximate sense. 

Suppose now that by using any algorithm (or simply by a 
magic guess) we have found a solution s of As = x, as an 
estimation of the true sparsest solution (s ). The question now 
is: "Noting that So is unknown, is it possible to construct an 
upper bound for the estimation error ||s — So||a only from s, 
where || • || 2 stands for the I 2 norm"! For example, if A satisfies 
the URP, and ||s||o is less than or equal L n /2J, where \_x\ 
stands for the largest integer smaller than or equal to x, then 
the uniqueness theorem insures that § = s . On the other hand, 
if all the components of § are nonzero but its ( |_?t, / 2 J + l)'th 
largest magnitude component is very small, heuristically we 
expect to be close to the true solution so, but the uniqueness 
theorem says nothing about this heuristic. 

In this paper, we will see that the answer to the above 
question is positive, and we will construct upper bounds on 
|| s — So || 2 without knowing So, which depend on the matrix 
A and (in the case A satisfies the URP) are proportional to 
the magnitude of the (["-/2J + l)'th largest component of s. 
Consequently, if the ( / 2 J + l)'th largest component of s is 
zero, then our upper bounds vanish, and hence s = so. This is, 
in fact, the same result provided by the uniqueness theorem, 
and hence our upper bounds can be seen as a generalization 
of the uniqueness theorem. In other words, from the classical 
uniqueness theorem, all that we know is that if among m 
components of s, to — [n/2\ components are 'exactly' zero, 
then s = So, but ifs has more than \n/2\ nonzero components 
(even if m — \n/2\ of its components have very very small 
magnitudes) we are not sure to be close to the true solution. As 
we will see in this paper, our upper bounds, however, insure 
that in the second case, too, we are not far from the true 
solution. Moreover, the dependence of our upper bounds on 
A provides some explanations about the sensitivity of the error 
to the properties of the matrix A. 

Constructing an upper bound on the error 1 1 s — so 1 1 2 can also 
be found in some other works, e.g. 11231 , ll24l . ||25| . In some 
of these works (e.g. Il23l . E4ll ) the bounds are probabilistic, 
that is, they have been obtained for random dictionaries and 
shown to be held with probabilities larger than certain values. 
Being non-deterministic, these bounds cannot be used to infer 
deterministic results. For example, they cannot be used to say 
whether or not the heuristic stated above (that is, "if s has 
at most n/2 'large' components, then it is close to the true 
solution") is generally true or not, while our bounds answer 
this question. Another difference between our bounds with 
those of J23], EH is that in (23), EU it has been assumed 
that we have at hand an algorithm for estimating the sparsest 
solution of an underdetermined linear system and several calls 
to this algorithm are required, whereas in this paper, we have 
at hand only a single estimation (s) of the sparsest solution 
(so), and we are going to develop upper bounds on the error 
|| s — s || 2 without knowing s . Moreover, the bounds in 
some of these works (e.g. 11241 . f25|) have been constructed 
for specific methods used for finding the estimation s, e.g. 



minimizing I 1 or £ q norms for < q < 1, whereas in this 
paper we are discussing the bounds based on s itself and 
independent of the method used for its estimation: it may be 
obtained by any algorithm or by a magic guess. In fact, to 
our best knowledge, constructing a deterministic bound on 
I s — So 1 1 2 and independent of the method used for obtaining 
s has not previously been addressed in the literature. Note 
however that although our deterministic bounds can be used 
to infer deterministic results, they are not suitable for practical 
calculation, because they need Asymmetric Restricted Isom- 
etry Constant (ARIC) [25], [26| of a dictionary, or similar 
quantities, whose calculation are computationally intractable 
for large matrices (note however that these quantities have 
to be calculated only once for each dictionary). We will also 
present a probabilistic bound for random dictionaries, which 
is again independent of the method used to obtain the estimate 
s. 

A related problem has already been addressed in E71 . in 
which, for the noisy case x = As + e, deterministic upper 
bounds have been constructed for the error 1 1 s — so || q (for a 
set of different q's including q = 2). However, in that paper it 
has been implicitly assumed that s is sparse in the exact sense, 
that is, ||s||o < L n /2J> otherwise, their upper bounds grow to 
infinity. On the other hand, if the noise power (||e| 
equal to zero, the upper bounds of l27l for ||s — s 
resulting again to the uniqueness theorem. In other words, 
reference [27 1 can be seen somehow as a generalization of the 
uniqueness theorem to the noisy case, whereas our paper can 
be seen as a generalization of the uniqueness theorem to the 
case s is not sparse in the exact sense. We will also consider 
in Section [V] the case where there is noise and s is sparse 
in the approximate sense. Some error bounds for the noisy 
case have also been obtained in [9|, but those bounds are for 
specific algorithms for estimating So, while our bounds are 
only based on s itself and independent of the method used for 
finding it. 

Some parts of this work have been presented in the confer- 
ence paper 11281 . Here, we study the problem more thoroughly 
(without repeating some details of that conference paper), and 
we provide also a tight bound on the above error. Imposing no 
assumption on the normalization of the columns of the dictio- 
nary, this tight bound will enable us to obtain a probabilistic 
upper bound. Moreover, we address the noisy case where s is 
sparse in the approximate sense. 

The paper is organized as follows. In Section |ll] we review 
a first result already stated in |fl9l , which provides the basic 
idea of this paper. Then in Section III we present a bound 



2 ) is set 
vanish, 



based on minimal singular values of the submatrices of the 



dictionary. Our tight bound is then presented in Section IV 



By considering the noisy case in Section [V] we complete 
our discussion on deterministic dictionaries before studying 
random dictionaries in Section [VT] 

II. A FIRST BOUND 

A first result has been given in Corollary 1 of Lemma 1 
of lfl9l during the analysis of the convergence of the SLO 
algorithm. We review that result here (with a few changes in 
notations). 
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Sorted |s»|'s (in descending order) 



(through the constant Ga) is very complicated. Moreover, cal- 
culating the Ga constant for a dictionary requires calculation 
of the pseudoinverses of all of the (™) + (™) + • • • + (™) 
elements of A4 n (A). In this section, we modify Q to obtain 
a bound that is easier to be analyzed and (in a statistical point 
of view) its dependence to (the statistics of) A is simpler. 
Moreover, we state our results for more general cases than 
where A satisfies the URP. 



Fig. 1. The definition of h(k,s): Sort the magnitudes of the entries of s in 
descending order. Then, h(k, s) is the magnitude of the fc's element (denoted 
by a in the figure). 



For the nxm matrix A, let Vj(A), 1 < j < m, denote the 
set of all matrices which are obtained by taking j columns of 
A. Moreover, let M„(A) = "Pi (A) U V 2 (A) U • • • U V n (A), 
and define 

G A ^ max ||Bt|| F , (2) 

BeVW„(A) 

where stands for the Moore-Penrose pseudoinverse of B, 
and \\-\\f denotes the Frobenius norm of a matrix. The constant 
Ga depends only on the dictionary A. Moreover, for a vector 
y and a positive scalar a, let ||y||o,a denote the number of 
components of y which have magnitudes larger than a. In 
other words, ||y||o,a denotes the £° norm of a thresholded 
version of y in which the components with magnitudes smaller 
than or equal to a are clipped to zero. 

The Corollary 1 of Lemma 1 of |[T9l states then: 

Corollary 1 (of |T9 |). Let A be an n x m matrix with unit £ 2 
norm columns which satisfies the URP and let S £ null(A). 
If for an a > 0, S has at most n components with absolute 
values greater than a (that is, if \\S\\q a < n), then 



\Sh < (Ga + l)ma. 



(3) 



We define now the following notation (see also Fig. [TJ: 

Definition 1. Let s be a vector of length m. Then h(k, s) de- 
notes the magnitude of the k 'th largest magnitude component 
ofs. 

Then, using the above corollary, Remark 5 of Theorem 1 
of Q9 1 states the following idea to construct an upper bound on 
1 1 s — so 1 1 2 as follows: Let a§ >n = h([^\ +l,s). Since the true 
sparsest solution (so) has at most [f J nonzero components, 
§ — So has at most n components with absolute values greater 
than a§ „, that is, ||s — s 1 1 o,ct§ < n - Moreover, (s — s ) £ 
null(A) and hence Corollary [T| implies that 



§ - S0II2 < (Ga + l)ma gjn . 



(4) 



This result is consistent with the heuristic stated in the 
introduction: "if § has at most n/2 'large' components, the 
uniqueness of the sparsest solution insures that § is close to 
the true solution". 

III. A BOUND BASED ON MINIMAL SINGULAR VALUES 

The bound (j4j) is not easy to be analyzed and worked with. 
Especially, the dependence of the bound on the dictionary 



A. Definitions and notations 

For a matrix B let <7 m i n (B) or <7 m i ni B denote its smallest 
singular valu^j] Similarly, we denote its largest singular value 
by °max(B) or er max b- We now define the following notations 
about the dictionary A: 

• Let q = q(A) = spark(A) — 1. Then, by definition, any 
q columns of A are linearly independent, and there is 
at least one set of q + 1 columns which are linearly 
dependent (in the literature, the quantity q is usually 
called 'Kruskal rank' or 'k-rank' of A). It is also obvious 
that q < n, in which, q = n corresponds to the case where 
A satisfies the URP. 

• Let o-^(A) or cr^ A denote the smallest singular value 
among the singular values of all submatrices of A ob- 
tained by taking j columns of A, that is, 

^(A)= min > min (B)}- (5) 

Note that since any q columns of A are linearly indepen- 
dent, we have <r ( Jl(A) > 0, for all 1 < j < q(A). 

Recall now the following lemma ll30l p. 419] (we presented 
a direct simple proof for the first two parts of this lemma 
in El). 



Lemma 1. Let B be an n x p matrix, and let B' denote the 
matrix obtained by adding a new column to B. Then: 

a) If p < n (B is tall), then cr m i n (B') < er mm (B). 

b) If P ^ n (B square or wide), then cr m i n (B') > cr m j n (B). 

c) We have always cr max (B') > er max (B). 

(i) 

Using the above lemma, the sequence er^v A , j = 1, . . . , m 
is decreasing for 1 < j < q and increasing for n < j < m. 
More precisely, if q — n (URP case), we have 



^ (2) . . (n) ^ (n+1) , 



min.A 




< a, 



(m) 



.A- 



and if q < n, we have 



^.A > 



> > 



J3+1) 

min, A 



— min.A — 



< cr 



(m) 



Note also that (|6|l and (j7]i imply that for 1 < j < q 



aZ(A)= min || Ax|| 2 /||x|| 2 . 

ll x llo<J 



(6) 



(7) 



(8) 



'in some references, e.g. [29], the singular values of a matrix are defined 
to be strictly positive quantities. This definition is not appropriate for this 
paper. We are using the more common definition of Horn and Johnson [ 30 
pp. 414-415], in which, the singular values of a p X q matrix M are the square 
roots of the min(p, q) largest eigenvalues of M^M (or MM"). Using this 
definition, there are always min(p, q) singular values, where a zero singular 
value characterizes a (tall or wide) non-full-rank matrix. 
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Remark. The quantity defined in |5]l is closely related to 
Restricted Isometry Property (RIP) flOl . 11311 . and is in fact 
the left Asymmetric Restricted Isometric Constant (ARIC) of 
A E5l . EB . As introduced in OID, the Restricted Isometry 
Constant (RIC) of A is defined as the smallest 5j such that 
(1-^)11x111 < ||Ax||l < (l + ^-)||x||| for all vectors xel"' 
with ||x||o < j. The lower and upper bounds of this inequality 
are symmetric, and hence the authors of J23J, introduced 
asymmetric RIC's, which are defined as the best a.j and f3j 
such that aj||x|| 2 < ||Ax|| 2 < /Sj||x||2 for all vectors x G K m 
with ||xj|o < j. Comparing with (j8J, it is seen that the left 
ARIC (af) is the same quantity denoted by <r^(A) in above. 

B. The upper bound 

Now we state the main theorem of this section: 

Theorem 2. Let A be an n x m matrix (m > nj with unit £ 2 
norm columns. Suppose that s is a solution of As = x for 
which || Sq || o < (■/% where I is an arbitrary integer less than 
or equal to <?(A). Let s be a solution of As = x, and define 
an,l = K\£/2\ + l,s). Then 



|s - s || 2 < 



1 



1 



(9) 



Before going to the proof, let us state a few remarks on the 
consequences of the above theorem. 

Remark 1. Suppose that As = x has a sparse solution So 
which satisfies ||s ||o < 59(A). By setting £ = q(A) in (SJ, 
which is the largest £ satisfying the conditions of the theorem, 
we will have 



§ - S0H2 < 



v min. A 



+ 1 \ m a 



s,g ■ 



(10) 



If the estimated sparse solution s satisfies also ||s||o < §, 
then a§,g = 0, hence the upper bound in ([9]) vanishes, and 
therefore s = s . In other words, the above theorem implies 
that a solution with ||s||o < 59(A) is unique, that is, the above 
theorem implies the uniqueness theorem. For example, for the 
special case of A satisfying the URP (q{A) = n), if we have 
found a solution satisfying ||s||o < § , we are sure that we 
have found the unique sparsest solution. 

Remark 2. Moreover, if the estimated sparse solution s is 
sparse only in the approximate sense, that is, if m— |_f J com- 
ponents of s have very small magnitudes, then a%„ is small, 



and the bound ( 10 1 states that we are probably (depending on 
the matrix A) close to the true solution. Moreover, in this case, 
"min a determines some kind of sensitivity to the dictionary: 
For example, if the URP holds (q = n) but there exists an 
n x n square submatrix of A which is ill-conditioned, then 

(n) 

°min a ^ s ver y sma H an d hence for achieving a predetermined 
accuracy, a§ „ should be very small, that is, the sparsity of s 
should be held with a better approximation. 

Remark 3. Theorem [2] states also some kind of 'sensitivity' 
to the degree of sparseness of the sparsest solution Sq. Let 
V — ll s o||o> an d set £ — 2p in J9I, and suppose that £ < q(A). 



Then the conditions of Theorem |2] have been satisfied and 
hence d9ll becomes 



|s-s || 2 




*s,2p • 



(ID 



min. A 



In other words, whenever s is sparser, p is smaller, hence 
from (rob and tl\ c r^? \ is larger, and therefore a larger ag i2 p 
is tolerable (that is, we have less sensitivity to exact sparseness 
of s). This can somehow explain the fact that sparse recovery 
algorithms work better for sparser Sq's |[T9l . 



C. Proof 

To prove Theorem[2]we first state a modified version of Q: 

Proposition 1. Let A be an n x m matrix (m > n) with 
unit £ 2 norm columns, and assume that any £ columns of A 
are linearly independent (£ < n). Let d G null(A). If for an 
a > 0, ||5|| , a < I, then 



< 



1 



1 



(12) 



Proof: The proof is based on some modifications to the 
proof of Lemma 1 of lfT9l . 

Let mi be the number of components of 6 with magnitudes 
larger than a, and m s be the number of components of 8 
with magnitudes smaller than or equal to a. In other words, 
m i — || || ot an d rn s = m — ||<5||o, a (indexes I and s in mi 
and m s stand for 'Large' and 'Small'). We consider two cases: 

Case 1 (mi = and m s = rri): In this case, all the 
components of 6 have magnitudes less than or equal to a, 
and hence we can simply write || J|| 2 < J2i 1^1 — ma which 
satisfies also ( |12) . 

Case 2 (1 < m; < £ and m s < m — 1): In this case, 
there is at least one component of S with magnitude larger 
than a. Let <5( be composed of the components of 8 which 
have magnitudes larger than a, and A; be composed of the 
corresponding columns of A. Similarljj^] let S s be composed 
of the components of S which have magnitudes less than or 
equal to a, and A s is composed of the corresponding columns 
of A. Since S G null(A), = AS = Aidi+A s S s , and define 



b = A/<5/ = -A S 5 S 



(13) 



From b = — A S 6 S , 

||b|| a = ||A s <y 2 = W^s^ih < W^h 



<a 1 

1 1> 1 1 2 < m s a < (m — l)a < ma. (14) 



From b = A;<$;, 



|b|| 2 = ||A I 5 I ||2>o- min (Ai)||5,||a 
llblla 



=> \\5,\\2< 



0"min(A;) 



(15) 



2 In MATLAB notation A ; = A(:, abs(<5) >a), A s = A(:, abs(S) < a). 
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Note that in the above equations, the assumption m; < I 
was essential, otherwise ||A/<5/||2 and cr m i n (A;) could be zero. 



Combining now (14i and (15i, we will have 

ma 



>l\\2 



< 



(16) 



Moreover, 



I <5 s 1 1 2 < m- s a < ("m — 1)« < toot. Therefore 

ma 



< 



\$th + ll*a||a 



< 



i( A i) 



ma. 



(17) 



> 



Now, from the definition |5]) and Lemma [I] £r m i n (A;) 
CT nS ( A i) ^ ^^nC^)' which proves the proposition. ■ 

Proof of Theo rem |5J So has at most |_§J nonzero com- 
ponents and s has at most [fj components with magnitudes 
larger than a. Therefore, § — So has at most £ components 
with magnitudes larger than a. Moreover, (s — So) G null(A). 
Hence, the conditions of Proposition [T] hold for 8 = s — s 
and a — a§^, which proves the theorem. ■ 

IV. A TIGHT BOUND 

Although the bound in (|9]l is relatively simple, except for the 
trivial case a§j — the equality in ([9]) can never be satisfied 
(as will be explained in the proof of Theorem [3]). Therefore, 
for an approximate sparse s the bound in |9]l is not tight. In 
this section we present a tight bound on the estimation error 
s — s , which does not depend only on minimal singular values 
of submatrices of A, but depends on a quantity 7(A) defined 
below. 



A. Definitions and notations 

Definition 2. Let A be an n x m matrix and m > n. For 
any B 6 Vj{A), let B c denote the matrix composed of the 
columns of A which are not in B. For j = 1, . . . , q (A) we 
define 

Cmax(B C ) 



(18) 



Be-Pj(A) [ cr min (B) 

Note that while j < q(A), cr m i n (B) > 0, and hence 
17,- (A) < 00 for j = l,...,g(A). 

Definition 3. Let matrix A be as in the previous definition. 
For j = 1, . . . , (7(A) we define the quantities 



7j (A) 4 yjim-j) (l + 7 7j 2 (A)), (19) 

7j ( A) 4 max {71 (A) , 72 (A) , ... , 7, (A) j, (20) 

% ( A) 4 max { V^, 71 ( A) , 72 ( A) , . . . , 7j (A) } • (21) 

We also use the notations 7(A) and 7' (A) to denote the 
largest Jj(A) goto? 7j(A) over f/ze whole range of j, that is, 
7(A) = 7?(A) = max{7i(A),...,7 8 (A)}, ant/ 7' (A) = 
max{^/m, 7(A)}. 



Remark. Note that the sequence m(A), j = l,.,.,n 
is not necessarily increasing. In effect, by taking one col- 
umn from a matrix M and appending it to a matrix N, 
the ratio cr^ ax (M)/cr^ 1 j I1 (N) does not necessarily increase, 
because both of its numerator and denominator decrease by 



Lemma [T] As an example, for the matrix A=[ 0.79, 0.82, - 
0.84, -0.82; -0.57, 0.55, 0.33, -0.38; 0.23, -0.19, -0.43, 0.42], 
which has normalized columns (up to 2 decimal points), 
the sequence -qj is approximately equal to {1.49,6.91,5.04}. 
Similarly, the sequence 7j(A) is not necessarily increasing. 
For the above matrix, this sequence is approximately equal 
to {3.11,9.87,5.14}. However, as we will see in Section VI 
for large random matrices with independently and identically 
distributed (iid) Gaussian entries, these sequences are both 
almost surely increasing (see Remark 3 after Lemma H). 

B. The upper bound 

Now we are ready to state the main theorem of this section, 
which provides a tight upper bound on ||s — S0H2: 

Theorem 3. Let A be an n x m matrix (m > n), and suppose 
that So is a solution of As = ~x.for which ||so||o < £/% where 
I is an arbitrary integer less than or equal to q(A). Let s be 
a solution of As = x, and define a§ t i = h( [£/2\ + 1, s). Then 

||s - S0II2 < TK A ) - a§,£ - (22) 

Moreover, if the columns of A are of unit t 2 norm, then 

||8- solla <7/(A)-ot M . (23) 

Remark 1. For deterministic dictionaries, one usually uses 
normalized atoms, for which (|23]l holds. However, d22]i will be 



useful for random dictionaries (Section VI 1 with independently 



and identically distributed (iid) entries, for which having the 
unit I 2 norm cannot be guaranteed. Note also that if we 
normalize the columns of such a random dictionary, its entries 
would no longer be independent. 



Remark 2. In particular, by setting I = q in ( 23 1, the bound 



( 10 1 is modified to || s — so || 2 < 7(A) • Ois,q, and for the URP 
||§ - soils < 7(A)- a a , n . (24) 



case 



In other words, if we know the constant 7(A) for our 
dictionary, its multiplication by the ( |_tz. / 2j + l)'th largest 
magnitude component of the estimation s would be an upper 
bound on the estimation error ||s — s 1 1 2 . Like in (|9j, this 
insures that if we have obtained an approximate sparse solution 
of As = x, we are probably close to the true sparsest solution. 
However, unlike |9]l the bound in ( f24"| is tight. We will see 
in Section |IV-D| an example for which the equality in ( f24| is 
satisfied. 

C. Proof 

From the proof of Theorem [2] it can be seen that if we 
obtain an upper bound for ||<$||2 where 8 e null(A) satisfies 
II^IIo.q < ^ for an a > 0, we will obtain a bound for ||s — 
So || 2 for a = a§^. Such a bound is given in the following 
proposition, as a modification to Proposition [T] 

Proposition 2. Let A and i be as in Theorem [i] Let 8 e 
null(A), and suppose that for an a > 0, ||<$||o a < L Let also 
mi and m s denote the number of entries of 8 with magnitudes 
larger than, and less than or equal to a, respectively. 
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a) Case mi — and m s = m: (i.e. where all entries of 8 
have magnitudes smaller than or equal to a). Then 

(25) 



||5||2 < spina. 

b) Case 1 < mi < £ and m s < m — 1: Let A; and A s 
be composed of the columns of A which correspond to the 
entries of 8 that have magnitudes larger than a, and less than 
or equal a, respectively (see footnote^. Then 



(26) 




Proof: The bound in ( 12 1 is not tight because several in- 



equalities used in its proof are not tight. Indeed, the equalities 
of the last inequality in (jT4|» and also the first inequality in 
( fT7| i can never be met (unless for the trivial case a — 0). 
Hence, we prove the proposition by modifying the proof of 
Proposition [T] Moreover, as opposed to what had been done 
in (17 i, in this proof we do not use the assumption that the 
columns of A are normalized. 

Note first that for a vector y = (y l7 . . . , y p ) T , if Vz : < 
Hi < a, then 



llylh < s/poi, 



(27) 



and the equality holds if and only if Vi : Ui = a. 

Proof of Part a (case m; = and m s = m): In this case, 



( |27[ > directly implies ( 25 i, where the equality holds if and only 
if \8i\ = a =^ Si = ±a. Moreover, for the equality being 
satisfied, 8 has to be in null(A), that is, A<5 = J2i <^ a i = 0, 
where a^s are the columns of A. Hence, the upper bound 



in ( 25 1 is tight, and is achieved only for the dictionaries for 
which a linear combination of their columns with +1 or —1 
coefficients vanishes ±a,; = 0). 

Proof of Part b (case m; > 1, m s < m—1): We follow the 
same argument as in the proof of Proposition [T] but instead 
of ( fl4| i we write 



A s £ s || 2 < CTma X (A s )||(5 s ||2- 



Combining it with ( 15 i, we will have (instead of ( 16 1) 

*(A.) 



'13 



< 



i(A, 



2- 



(28) 



(29) 



Finally, instead of (17i we write 



1*111 = 



l*f 



| \, and 



hence from the above inequality we will have 

c(A-s) ,1 x n 2 



1*111 < 



2 

max V 

7T 

min v 



l 



2 

max 
72 

min 



Finally, we use (27|i again to write ||<$ s ||2 < a. 



(A,) 
(A;) . 

(30) 

/m s , which in 



combination with the above inequality gives (26 1. ■ 
To prove Theorem [3] we also need the following lemma, 



proof of which is left to Appendix VIII-A 

Lemma 2. Let A be an n x m (m > n) matrix with unit £ 2 
norm columns. Then 7i(A) > \fm . 



Proof of Theorem^ Note that (s — s ) £ null(A). More- 
over, So has at most [^J nonzero components and s has at most 
[|J components with magnitudes larger than Therefore, 



s — So has at most £ components with magnitudes larger than 
ag y g, that is, it has either 0, or 1, .. ., or I components with 
magnitudes larger than a§/. If it has 1 < j < £ components 
larger than a§j, from (26 1 we have 



|s — S0II2 < olu,i\ {m-j) 1 + 



<(A.) 



< 7 i (A)ag i ^, 
(31) 

because jj had been defined as the maximum of 
yj{m-j) (1 + cr max (A s )/cr 2 nin (A i )) for all possible choices 
of A; and A s . On the other hand, if s — So has no components 
larger than a§^, from (25 1 we have 1 1 s — sq 1 1 2 < ds,t\/m. This 



completes the proof of (22 1. Then, combining it with Lemma[2] 
proves (|23|). ■ 



Experiment. To experimentally compare the "first bound" 
given in d9| and the "second bound" given in ( 23 1, we 



conduced a simple experiment. We firstly generated a random 
A of dimension 8 x 12 by generating each of its entries using 
iV(0, 1) distribution, and then divided each of its columns by 
its norm to obtain a unit norm column matrix A. We generated 
then a random sparse So by randomly choosing the positions 
of p = 2 of its entries and then assigning random magnitudes 
(drawn from a iV(0, 1) distribution) to these positions and 
zero to other positions. Then x = As was calculated and 
x and A were given to SL0 algorithm and the parameter 
Cmin of SL0 was chosen relatively large (0.1) to force SL0 
to create a not so accurate estimation s. Then the actual error 
II s — so || 2, the "first bound" and the "second bound" were 
calculated by setting £ = 2p (note that A is of relatively 



small dimensions, permitting exact calculation of a 



min, A 



and 



7(A) using a combinatorial search). The whole experiment 
was repeated 100 times by regenerating A and s . The average 
values of the ratios (First bound)/(Actual error) and (Second 
bound)/( Actual error) through these 100 experiments were 
47.4 and 16.2, respectively. It is seen that the second bound 
is highly tighter than the first bound. Moreover, although the 
ratio (Second bound)/(Actual error) is seen to be in average 
16.2, this bound is in fact a tight bound, in the sense that 
there are instances of A, Sq and s such that the equality in 



(23 1 holds. Such an example is given in the next subsection, 
proving the tightness of this bound. 

D. Example of equality in Theorem [i] 

To show that the bound given in Theorem [3] is tight, we 
present the following tricky example, which is stated in the 
form of a proposition. 

Proposition 3. The estimation error ||s — Sq|| 2 achieves its 



upper bound in ( 24 1 for 



COS! 

sinf 



sm 



2 

COS 7 





" p ' 




" " 


s = 





, § = 


p 









a 



(32) 

for any <9 < cos _1 (^|=i) « 38.6683°, any a > 0, and 
/3^a/(2sinf). 

For example, for 9 — 5° and a = 0.2, we have the following 
example (up to 4 digits) which achieves the upper bound of 
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Fig. 2. The columns of the matrix A in (32) are shown as the vectors ai, 
a 2 and a 3 . 



241: 



A = 



1 0.9962 
0.0872 



0.0436 
-.9990 





"2.2926" 







j s o — 





, § = 


2.2926 









0.2 



Note that § in this example is an approximately sparse 
solution of As = x, where x = As , but it is completely 
different from s . 

For proof, we need the following lemma: 

Lemma 3. a) Let B be a single- column matrix fi.e. a column 
vector), and this column has unit £ 2 norm. Then the sole 
singular value of B is equal to 1. 

b) Let B be two-column matrix (with more than one row), 
columns of which, t>i and b 2 , have unit £ 2 norm. Then the 
two singular values of B are equal to er^ lin (B) = 1 — p and 
<T max(^) = 1 + A where p = IbJ'ba] = | cos(f\, in which <p 
is the angle between hi and b 2 . Note that a smaller angle if, 
results both in smaller <7 m i n and larger <J max . 

Proof: The result is simply obtained by direct calculations 
of the eigenvalues of B T B. ■ 

Proof of Proposition^ Step 1) Note that A satisfies the 
URP. By defining x = Aso, it can be easily verified that both 
So and § are solutions of As = x. Moreover, for < 9 < 60°, 
we have (3 > a and hence a§ „ = a. 



Step 2) Calculating 71(A): From definition ( 18 I, for j = 1, 
B has only one column and hence <r min (B) = 1 from 
Lemma [3] Let ai, a 2 and a3 denote the columns of A, 
respectively. These vectors are drawn in Fig. [2] To maxi- 
mize cr max (B c ) among the 3 possible choices for B c , using 
Lemma [3] we have to find the two vectors for which the 
absolute value of the cosine of their angles is maximum. 
Simple manipulation of Fig. [2] shows that for < 9 < 60° 
the maximum of er max (B c ) is obtained for B c = [ai,a2]. 
Consequently, 771(A) = 1 + cos 9 =>■ 7? (A) = 2(2 + cose?). 

= 2, 
1 from 



Step 3) Calculating 72(A): From definition ( 18 1, for j 
B c has only one column and hence er max (B c ) 
Lemma J5] Among the 3 possible choices for B, using 
Lemma [3] it can be seen that for < 9 < 60° the minimum 
of cr m i n (B) is obtained for B = [ai,a2]. Consequently, 
17! (A) = 1/(1 - cost?) =► 7§(A) = 1 + 1/(1 - cos0). 

Step 4) Comparing 71(A) and 72(A): Simple algebra 
shows that 

7^(A) > 7 2(A)^2cos 2 6l + cosc?-2 >0^ 

Vvf- 1 



cos 9 > 



\0\ < 9 , 



where 9 = cos"^^^ 1 ) 38.6683°. Hence, since by the 
assumption < 9 < 9q, we will have 



7(A)= 72 (A) = Jl 



1 



1 — cos 9 



2 sin 2 



(33) 



Step 5) Now, for (3 — a/ (2 sin |) we write 



||s - solli = 2(3 2 + a 2 = a 2 1 + — = f(A)a 2 

V 2sm 2) 

which completes the proof. 



V. The noisy case 
Instead of the noiseless system consider now the noisy 



x = As + n, 



(34) 



where n denotes the noise vector, and ||n|| 2 < e with e > 0. In 
SCA applications, n denotes the measurement noise in sensors, 
and in sparse signal decomposition applications, ( |34| is for 
modeling approximate signal decomposition, in which e is the 
acceptable tolerance of the decomposition. 

In presence of noise, the minimum £° norm solution of 
As — x is not stable, in the sense that if x = Aso + n 
(in which So is sparse), then the minimum £° norm solution 
of As = x may be completely different from s , even for very 
small amount of noise |9|, |32|. Hence, instead of finding the 
sparsest solution of x = As, it is proposed to estimate So 
as E| 



s = argmin ||s|| s.t. 



|As-: 



<S, 



(35) 



for a 5 > e > 0. This approach insures the stability in the 
sense that ||s — So 1 1 2 grows at worst proportionally to the noise 
level (9), ||33l (other variants of the above expression have 
also been studied in the literature, for example replacing the 
£° norm with i 1 norm, or replacing the I 2 norm in ||As — x|| 2 
by i 1 and £°° norms 0, 03, (35), Eg), 11371 ). 

In this section, we are going to study the generalization 
of the main question of previous sections to this noisy case: 
If we have an estimation s satisfying ||As — x|| 2 < S, and 
if s is sparse only in the approximate sense|^] is it possible 
(without knowing so) to construct an upper bound on the error 
|| s — so || 2? Indeed, we will firstly generalize the looser bound 
Then, it will be seen that the tight bound ( |23] l would not 
be easy to generalize as a closed form formula. 



A. Generalizing the looser bound ([£]) 

1 ) The theorem: The generalization of the looser bound to 
the noisy case is given by the following theorem: 

Theorem 4. Let A be an n x m matrix (to > n) with unit £ 2 
norm columns. Let x = Aso+n, where 1 1 n| 1 2 < e and 1 1 Sq 1 1 ^ 
£/2, in which £ is an arbitrary integer less than or equal to 

3 Such a solution may be obtained for example using Robust-SLO |38|. 
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q(A). Let s be an estimation of Sq satisfying || As — x|| 2 < 5, 
and define ag.g = h([£/2\ + l,s). Then 



solla < 



w/iere A = e + S. 



min , A 



1 mag 



At) ■ 

min, A 



(36) 



Remark 1. For A = (which corresponds to the noiseless 
case), the above bound reduces to the looser bound for the 
noiseless case, i.e. po} reduces to ([9j. 

Remark 2. It is also interesting to consider the case a%.g = 0. 
It corresponds to the case where § is sparse in the exact sense, 
e.g. where § is an exact solution of (35 1 for a S > e. In this 
case, the bound d36l) becomes 



§ - So||2 < 



e + S_ 
J*) 



(37) 



The above inequality holds for all values of £ satisfying the 
conditions of the theorem, and hence also for I = q(A) 
(which is the largest possible I). This proves that the problem 
(351 is stable for all ||sq||o < \spark(A), i.e. for the whole 
range of uniqueness of the sparse solution, while in [9|, this 
stability had been proved only for the highly more limited 



range ||s ||o < 



l+AT 



where M is the mutual coherence of 



A. We have discussed this generalization of the stability and 
the inequality (37i in the correspondence ll33l . 

2) Proof: To prove the above theorem, we need to first 
generalize Proposition [T] to the noisy case. This is given in 
the following Proposition, which is based on a modification 
of OS Lemma 4]: 

Proposition 4. Let A be an n x m matrix (m > n) with 
unit i 2 norm columns, and assume that every I columns of A 
are linearly independent (£ < n). Let 5 be a vector satisfying 
\\AS\\ 2 < kfor a constant A > 0. If for a > 0, ||<5|| , Q < £, 
then 

\\Sh< h^ + lUa+^ (38) 



min, A 



min, A 



Proof: The proof is based on modifications in the proof 
of Proposition [T] Let <5(, S s be as defined in that proof. 

Case 1 (mi = and m s = m): Exactly like the proof of 
Proposition [T| ||<5|| 2 < "ma which satisfies also ( |38) . 

Case 2 (1 < m; < £ and m s < m — 1): Let Aj and 
A s be as defined in the proof of Proposition [T] We define 
again b = A/<5;, but b is no more equal to —A S S S . Similar 
to (14i we have ||A S 5 S ||2 < ma. For upper bounding ||b||2, 
instead of ( 14 1, we use the general inequality 1 1 y 1 1 1 2 — 1 1 y 2 1 1 2 < 
' -y 2 || 2 to write 



yi 



||Ai*,|| 2 -||A a «,|| a < ||A,*, + A S 5 S || 2 < A 



< IIA.,(5, 



A < ma + A. 



(39) 



4 In fact, in that correspondence, we have even prese nted a more general 
result: we have shown that even minimizing ||s||o in |35| is not necessary, that 
is, the stability is resulted only from ||As — x||2 < <5 for the whole uniqueness 
range 1 1 sq 1 1 < |spaii(A), provided that the estimation s satisfies also 
||s||o < |sparic(A). 



Putting this in ( fl"5j ), we have (instead of ( [T6| ) 

ma + A 



'12 



< 



Cmin(Ai) ' 



(40) 



and hence §T7\ becomes 

TDCV A- /\ 

\\Sh < ll^illa + \\Ssh < 7^ + ma, (41) 

0"min(A ; ) 

which in combination with er m i n (A;) > a*v n A completes the 
proof. ■ 

Proof of Theorem |?J Similar to the proof of Theorem [2] 
s — So has at most I components with magnitudes larger than 
a. However, here s — So is not necessarily in null(A). Instead, 
we use the general inequality ||yi|| 2 - ||y 2 || 2 < ||yi - y 2 || 2 
to write 

||A(s - s ) 1 1 2 - ||n|| 2 < ||As - As - n|| 2 = ||As - x|| 2 < S 
||A(s-s )|| 2 < \\n\\ 2 + S < e + S. (42) 

Therefore, S = s — s satisfies the conditions of Proposition [4] 
for A = e + S, which completes the proof. ■ 



B. Generalizing the tight bound ( 23 I 

In this section, we will see that obtaining a 'closed form' ex- 



pression as a generalization of the tight bound (23 1 to the noisy 
case would not be easy. In fact, it is seen that the generalization 
of the looser bound was based on generalizing Proposition [T] to 
Proposition [4] Similarly, we can also generalize Proposition [4] 
to the noisy case: 

Proposition 5. Let A, £, S, a and A be as in Proposition |4] 
Let mi and m s denote the number of entries of d with mag- 
nitudes larger than, and less than or equal to a, respectively, 
a) Case mi = and m s = m: We have 



<5|| 2 < \/m( 



(43) 



b) Case 1 < m; < £ and m s < m — 1: Let A; and A s 
be composed of the columns of A which correspond to the 
entries of S that have magnitudes larger than a, and less than 
or equal a, respectively. Then 

r max(As) 



1*115 < 



1 



<in( A /) 



+ 2- 



c(A s ) 



m s a + 



im s aA 



A 2 



(44) 



CT min (A ; ) CTmin(Ai) 

Proof: Part (a): The proof is the same as the proof of part 
(a) of Proposition [2] 

Part (b): Let again b = A; 81. With similar reasoning as in 
(39]) we have ||b|| 2 < ||A S <5 S || 2 + A. Moreover, ||A S <5 S || 2 < 

£ r max(A s ) 



1 2, and hence 
||b|| 2 < a max (A s 



A. 



(45) 



(46) 



Combining it with ( 15 1, we will have 

He H ^ o- max (A s )||(5 s || 2 
\\°lh S /. >, 

Moreover, ||^ s || 2 < y/m s a (from Eq. (27i). Combining this, 
the above inequality, and ||5||| 
proposition. 



S s \\l + \\Si\\l proves the 
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However, although Proposition [4] was generalized to Propo- 
sition [5] it would be tricky to use it to obtain a closed form 
generalization of (23 1 to the noisy case. To see the difference, 
recall the argument of using ( |26] i to obtain p3j ) (which is 
the same argument as using ( |12| i to obtain pi): under the 
conditions of Theorem [3] S = s — So satisfies the conditions of 
Proposition [2] for a = However, since we don't know Sq, 
we don't know which components of 6 are smaller and which 
ones are larger than a, and hence, we don't know A s and A;. 
Therefore, we consider the worst case of the bound given by 



(26 1, that is, we maximize the right hand side of (26 1 on all 
possible partitionings of A into A; and A s (where the number 
of columns of A/ is at most equal to £). The point is that the 
right hand side of ( |26] > was in a form that its maximization 
with respect to all possible partitionings of A was independent 



of a, and gave us the bound ( 23 1, in which, we had a constant 
7(A) which is independent of a, and depends only on the 
dictionary. 

However, with the same reasoning, to obtain an upper 
bound on ||s — So || 2 under the conditions of Theorem [4] we 
have to maximize the right hand side of (44 1 with respect to 
all possible partitionings of A into A/ and A s (where the 
number of columns of A/ is at most equal to I). However, 
here, this maximization depends also on a and A, because 
cr max (A s )/cr min (Ai) and l/a min (Ai) are not necessarily 
maximizeq^] for the same partitioning of A. Consequently, we 
can probably say nothing better than: 

Theorem 5. Let all parameters be as defined in Theorem |2] 
Let B G Vj(A) and let B c denote the matrix composed by 
the columns of A that are not in B. Define 

7 max( BC ) 



/(A, a, A) 



max max 

\<3<l Be-Pj(A) 



1 



+2- 



£ (B C 



.(B) 



.(B) 
A 2 



.(B) 



m s a 2 + 



(47) 



Then 



||s - s ||l < max (may , /(A, a§^, A)} 
Moreover, if A has unit £ 2 norm columns, then 
||8 - soil! < /(A,a M ,A). 



(48) 



(49) 



Remark. In Theorems Q to Q the quantities 7(A) or cr^™j A 
are calculated only once for each dictionary, and then they 
are used with a%,i (and probably A) of a specific problem. 
However, in the above theorem, the dictionary, a and A 
interact, and hence the upper bound should be calculated for 
each specific problem separately and since this calculation is 
NP-hard, its usage in practical problems is probably limited. 

VI. Random Dictionaries 

Theorems [2] and [3] suggest that cr^(A) and/or 7^ (A) 
are important parameters of a dictionary. However, estimat- 
ing these parameters for a deterministic matrix seems to 
be NP-hard (this has already been proven for estimating 

5 Using a small MATLAB code, it is easy to find examples of A for which 
these two quantities are maximized for different partitionings. 



CT min(A-) [39]). In effect, calculating <7^(A) requires ex- 
amination of all n x I submatrices of A, and calculation 
of 7^ (A) requires examination of all n x j submatrices of 
A for j = 1, ...,£. These tasks are combinatorial and 
intractable (although they have to be done only once for a 
given dictionary). Moreover, even finding a computationally 
tractable lower bound for c^„(A) or an upper bound for 
7' (A) would provide us a computable upper bound for the 
error. 

On the contrary, for a random A with independent and 
identically distributed (iid) entries, we need no more to ex- 
amine all of its (™) submatrices to obtain probabilistic upper 
bounds, because all nx£ submatrices are statistically identical. 
On the other hand, singular values of random matrices have 
extensively been studied in the literature [40], BTIl . Indeed, 
it is well-known that the singular values of random matrices 
are not "so random" and are highly concentrated around some 
deterministic values ll42l Th. 2.7]. Random dictionaries are 
also practically important, e.g. they are frequently used in 
compressed sensing [3[. Note also that random matrices satisfy 
the URP with probability 1. 

In this section, we consider random dictionaries, and state 
some probabilistic upper bounds for the estimation error ||s — 
So || 2 without knowing So, and independent of the method used 
for estimating §. 



A. Review of some results from random matrix theory 

Let X be an n x p random matrix with independent and 
identically distributed (iid) entries with zero mean and variance 
- (hence the expected values of the I 2 norm of its columns are 
equal to 1). A famous result by Marchenko and Pastur BT1 Th. 
2.35] states that if the entries of X come from any distribution 
with fourth order moment of order O(^), as n,p — > 00 and 
£ — > c > 0, the empirical distribution of singular values of 
X converges almost surely (a.s.) to a distribution bounded 
between 1 1 — %fc | and 1 + -y/c. Moreover, if the entries 
come from any distribution with finite fourth order moment, 
Cmax(X) has been shown [43] to converge a.s. to 1 + y/c (this 
result has been firstly stated by Geman P4l under some more 
restrictive conditions). Similarly, if < c < 1 (i.e. for tall 
matrices), it has been shown (firstly in P31 for the Gaussian 
case and then in 11461 for any distribution with finite fourth 
moment) that er mm (X) 1 — ^fc. As it is said in |46|, it is 
obvious that a similar result holds also for wide matrices, that 
is, where c > 1. To see this, let c > 1. Then X' = ^n/p~K T 
is a tall p x n matrix with iid zero-mean elements with 
variance h, and hence, \f\fc cr m i n (X) -^—t 1 — yljc, and 



consequently <r m i n (X) ^[c— 1. Hence, generally, if c ^ 1, 
then er mm (X) —4 |1 — \fc\. The case c = 1 is, however, 
more complicated. For example, for Gaussian square random 
matrices (p = n), if n — > 00, then the probability density 
function (PDF) of the random variable cr m i n (X) converges to 
a simple known function BOl Th. 5.1], 1471 . In other words, for 
c 1, as n — >• 00, the PDF of er min (X) converges to a Dirac 
delta function (i.e. er m i n (X) converges a.s. to a deterministic 
value), but this is not true for c = 1. 
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Moreover, if n > p (tall matrix), and the entries of X 
are drawn from a N(0,l/n) distribution, then a result due 
to Davidson and Szarek EH Th. 2.13], EU Eqs. (4.35) and 
(4.36)] states that for any r > 



Cmax(X) > 1 



(X) < 1 



n 

P _ _ 
n 



< e" 



< e 



-nr 2 /2 



-rar 2 /2 



(50) 



(51) 



Note that the second inequality is mostly useful for < r < 
1 — \Jp/ n (otherwise it is trivial). 

It is not difficult to see that similar equations hold also 
for wide matrices. To show it, let X be an n x p random 
matrix with n < p (wide matrix) and with elements drawn 
from a N(0,l/n) distribution (again normalized columns in 
expected value). Then, Y = y^n/pX. T is a p x n tall matrix 
with N(0,l/p) entries, for which we can write the above 




inequalities. For example, by writing ( 50 » for Y we have 

nr 
p 

In 
— < 
P 

lp 
n 

Hence, by defining r' = ryp/n, we have 



<(X)>^ 



< e 



-pr' 2 m/2 _ ~nr' 2 /2 



Consequently, (50 1 also holds for wide matrices. Similarly, it 



can be seen that for wide matrices (p > n), the inequality (51 
becomes 



(X) < J£ 
n 



1 



< e 



-rar 2 /2 



(52) 



B. Definitions and notations 

Let the dictionary A be a random n x m matrix with m > 
n and with iid entries drawn from a N(0, 1/n) distribution. 
In this section, we use Theorem [3] and Davidson and Szarek 
inequalities to obtain a probabilistic upper bound for the error 

ll § - s o||2- 

Remark. Note that the I 2 norms of the columns of such an 
A are not necessarily equal to one (although their expected 
values are equal to one). Hence, we cannot use the bound 
given in Theorem [2] for this A, because that theorem requires 
that the columns of A have unit £ 2 norms. Moreover, if we 
normalize the columns of A by dividing them by their £ 2 
norms, the new entries would no longer be independent, and 
consequently, we cannot use Davidson and Szarek inequalities 
and many other results in random matrix theory which require 
independent entries. Hence, it is not straightforward to obtain 

s In a pers ona l email communication with the first author, Pr. Szarek has 
generalized (51) to the case where the elements of A are firstly drawn 
independently from a zero mean Gaussian distribution and then each column 
is divided by its I 2 norm to obtain a unit (? norm column dictionary. The 
final bound is looser than \5l) and is in a more complicated form. Hence 
obtaining a probabilistic bound based on Theorem [2] is indeed possible, but 
is not straightforward. We don't consider it in this paper because the error 
bound is both looser and more complicated. 



a probabilistic bound based on Theorem [2] (this is a mistake 
that we had done in the conference paper ll28ll ). However, 
Theorem [3] does not require unit i 2 norm columns, and we 
can use it to obtain a probabilistic upper bound on ||s — So 1 1 2 - 

Note that the bound in Theorem [3] is based on the quantities 
777(A) and 77(A) defined in (Jl8j) and (Jl9ji, respectively. Hence 
to obtain a probabilistic bound on the error ||s— S0II2. we obtain 
probabilistic upper bounds for these quantities. 

For the random dictionary A defined above, for any division 
of A into B and B c as stated in the definition of 777 (A), 
from the results of random matrix theory stated in the previous 
subsection, we expect that <r m i n (B) and er max (B c ) are close 
/j/n and 1 - 



to 1 — y/j/n and l + y/J/m — j)/n, respectively. Hence, r/j(A) 
and 7j (A) are expected to be close to quantities 



1 



and 



70 = sj{m-j) (l + (*?L?]) 2 ) ; 



(53) 



(54) 



respectively. More precisely, the results of the previous sub- 
section imply that where m, n — > 00 while j/m converges 
to a constant and j/n converges to a constant strictly smaller 
than 1, then r)j(A) and jj(A) will converge a.s. to and 
7 [7], respectively. 

To measure the deviation of %(A) and 7j(A) from the 
above quantities (to larger values), let's define the shorthands 



1 - 



(55) 
(56) 



However, note that the bound of Theorem [3] depends mainly 
on the quantities 7^ (A) and 7^(A), not 77(A). In other words, 
we need to maximize 7 [7] (or 7 nr2 [7]) over j = 1, 2, . . . , I. 
The following lemma, whose proof has been left to appendix, 
shows that the sequences defined above, i.e. rj ri7 . 2 [j] and 
7nr 2 [j] an d hence 77 [7] and -f[j] (as the special case of r] ri r 2 \j) 
and 7r 1 r a [?'] f° r r i = r 2 = 0), are all increasing with respect 
to j. Therefore, it shows that the above maximum is obtained 
for j = I 

Lemma 4. The sequence {7nr 2 [i]}> j — L ■ • • >■£ where i < 
n — 1, is strictly increasing for all r± > and < T2 < 
1 — \j Ijn. Moreover, Vj : J ri r2 \j] > y/m. 

Remark 1. The above lemma shows that the sequences 77^7] 
and r] ri r 2 [i] are also increasing, because the product of (1 + 
r ir 1 r 2 \j]) anc l tne decreasing (and positive) sequence {m — j} 
has become an increasing sequence. 

Remark 2. Note that for large dictionaries (more precisely 
where m,n — > 00 while j/m converges to a constant and 
j/n converges to a constant strictly smaller than 1), 77(A) 
converges a.s. to 7 [7]. The above lemma states hence that 
for large dictionaries the sequence 77(A) is almost surely 
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increasing and hence a.s. jj(A) = Jj(A) = Moreover, 
the second part of the above lemma states that a.s. 7j-(A) = 

7j(A) =7i(A) =70]. 

Remark 3. In the Remark after Definition [3] in Section [TV- A| 
we had provided an example of a matrix A for which the 
sequences rij(A) and 7j(A) were not increasing. The matrix 
A of that example was of very small size, and the above 
remark states that where the size of the dictionary grows 
finding such examples becomes more and more difficult. 

C. Probabilistic bounds on | | s — s 1 1 2 

In this section, we state two theorems as probabilistic upper 
bounds on the error, where the dictionary A is random with 
iid Gaussian entries. The first theorem states a bound for 
dictionaries of any size, whereas the second theorem considers 
the case of large dictionaries. 

Theorem 6. Let A be an n x m, m > n, random matrix 
with iid and zero-mean Gaussian entries. Let I be an integer 
in the range 1 < i < n — 1. Suppose that s is a solution 
of As = x for which ||so||o < 1/2. Let s be a solution of 
As = x, and define = h{]£/2\ + l,s). Then for all 
n > and < r 2 < 1 - \flfn, 




nrf/2 



-nrl/2 



(57) 

Remark 1. When A is random as in Theorem[6j it satisfies the 
URP with probability 1, and hence by the uniqueness theorem, 
any solution with ||so||o < § would be unique. Suppose, 
however, that p = 1 1 so 1 1 < § (not P < § ) an d set I — %P- 
Then £ < n — 1 and hence from d57l, 



TYl N 
1 



2 > (7r x 

m 
2 



r 2 [2p))ag :2 p} < 
m ' 



-nrl/2 



-nrl/2 



(58) 



Remark 2: Note that when n grows, (58 1 does not necessarily 
provide a good upper bound on p — S0II2, in the sense that 
as n increases, the probability that p — So||o > 7rir 2 [2p]aa,2p 
does not necessarily decreases exponentially to zero. The point 
is that the maximum value for r 2 in Theorem 6 is 1 — Wi/n, 
hence, where n increases, although the term n in e rar 2'^ 

— 2 to 

increases, r2 has to be smaller, and hence e nr ^' z does not 
necessarily decrease. In fact, the degree of sparsity of So plays 
an important role here. For example, let choose r\ = r 2 , and 
suppose that 2p is equal to its maximum theoretical value 
(which is n — 1, because we had already excluded p = n/2 
that turns 7 rir2 [2p] to infinity). We expect heuristically that for 

large n\ n 2p (A) is close to V [2p] = (l-J™=^)/{l-J*B-). 
Taking into account the form of the denominator of r) rir2 [2p], 
for having 7 riI . 2 [2p] not too far from 7[2p], let choose the 
value of r 2 as a small fraction of 1 — \J2pfn, that is, let 



f2 = c • (1 — \]2pjn), where < c < 1. Then from (58 1, 
P{p-so||a > (7nr 2 [2p])a §)2p } < ^"t 1 -/?) 2 ^ (59) 



where A = 2 J2j=i {"j)- Consider now the exponent c?n(l — 
y/2p/n ) 2 /2. If 2p — n — 1, this exponent is in fact a 
decreasing function of n, and converges to where n — > 00. 
Consequently, by increasing n, not only e~ nr2 ' 2 does not 
decrease, but also it increases to 1. 

Another way to see the above problem is to note that, as 



stated after (53 I, for large matrices r)2 P (A) converges to i][2p] 
only if 2p/n converges to a value 'strictly' smaller than 1. 
This is also seen from the discussion at the end of the first 
paragraph of Section [VI- A| 

On the other hand, if p can be at most equal to a 
fraction of n/2, say p — u ■ {n/2), where u < 1, then 
exp {— c 2 n(l — y /: u) 2 /2}, which exponentially 



-nrg/2 



decreases where n — > 00. The right hand side of (58 1 does 
not yet necessarily decrease, however, due to the combinatorial 
part. We can, however, state the following theorem, for smaller 
it's (as will be discussed after the theorem): 

Theorem 7. Let A be an n x m, m > n, random matrix with 
iid zero-mean Gaussian entries. Suppose that s is a solution 
of As = x with sparsity p, i.e., p = polio- s be a solution 
of As = x, and define a§.2 P = h(p + 1, s). If n —> 00, while 
2p/n — > u < 1 and m/2p — > v, then for every n > and 
< T2 < 1 — \fu, with an exponentially increasing probability 
{with respect to n) we have 



l|s-s || 2 



< 



[2p] 



provided that 



u(l + In v) < min(r 2 , J"2)/2. 



(60) 



(61) 



Remark. Condition (61 1 puts a limit on the maximum of the 
sparsity (it) for which the above theorem is applicable. To see 
this, let fix the underdeterminedness factor f3 = m/n > 1. 
Then v = P/u, and hence ( 6 1 1 and r 2 < 1 — s/u imply that 



and hence (61 

u(l + ln£) < (l-^) 2 /2. 



(62) 



It is easjp] to see that for each /3 > 1, the function u(l + 
ln(/3/u))/(l — \/u) 2 is strictly increasing with respect to u 
over u £ (0, 1). Hence, if (62 1 holds for a u = uq, it holds 
also for every u < Uq. Therefore, ( |62| puts a limit on the 
maximum of the sparsity for which (|60[) holds. 



Moreover, if, as done in (59 1, we choose r*i = r 2 = c(l 
y/u ), where < c < 1, 



then (|61 



states that 



u(l + ln^) < c 2 (l - Vu) 2 /2. 



(63) 



Similarly, this equation puts a limit on the maximum of 
sparsity, and since it(l + ln(/3/u))/(l — \A*) 2 i s increasing, 
for smaller values of c, this maximum on sparsity is more 
restricted. 



By replacing the inequality in ( 63 1 with equality and solving 



it with respect to u, for each /3 we obtain the supremum of u 
for which Theorem [7] is applicable. Figure [3] shows the plot 

7 This is because direct calculation shows that the derivative of u(l + 
ln(/3/u))/(l — \/u) 2 with respect to u is equal to {y/u + ln(/?/u))/(l — 
which is strictly positive for /3 > 1 and < u < 1. 
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Fig. 3. The supremum of the values of sparsity (u = 2p/n) versus = m/n 
for which Theorem [7] holds. 



of this supremum versus f3 for different values of c. Note that 
the value c = 1 cannot be used, because it turns 7 nr2 [2|>] 



and hence the right hand side of (60 1 to infinity. It has been 
plotted, however, because it indicates the supremum value of 
sparsity for which one can choose a value for r 2 such that 
Theorem [7] is applicable^ It is seen that the range of sparsity 
for which we can use this theorem is highly more restricted 
compared to the uniqueness condition u < 1. 



D. Proofs 

We need first the following proposition that states proba- 
bilistic upper bounds on quantities i] rir2 [j] and 7 ri r 2 b']- 

Proposition 6. Let A be an n x m, m > n, random matrix 
with iid and zero-mean Gaussian entries. Then for each j = 
1, . . . , n — 1 and for all ri > and < r 2 < 1 — y/JJn we 
have 



P{%(A)>ry nr2 [j]} < 
and hence 
P{7j'(A) >7nr a b']} < 



)(« 



-nrf/2 



r?/2. 



-nr 2 2 /2 



1/2 



(64) 



(65) 



Remark. Any upper bound on (™) can be used to replace 
this term in (|65]>. For example BHl Sec. IV-A], 



where Vx <G (0, 1), H(x) = —xlxix — (1 — x) ln(l — x). 



(66) 



Proof of Proposition [6f There is no assumption in the 
lemma about the variance of the entries of A. However, 
since multiplying each entry of A by a constant does not 
change rjj(A) and jj(A), it can be assumed, without loss of 
generality, that this variance is equal to —. Let now B be a 
submatrix of A obtained by taking j 'fixed' columns of A, 

8 One may note some kind of tradeoff here. Smaller c results in les s 
deviation of 7 ri r 2 [2p] from 7[2p], and hence a better upper bound in j60| , 
but it decreases the sparsity for which Theorem |7J is applicable. 



and define t/b — 0max(B c )/<7 m i n (B). Then, from Davidson 
and Szarek inequalities, we have 



(B c ) > 1 



m-j 



+ n } < e -" r i/ 2 , (67) 



cr min (B) < 1 - A / J n - r 2 U (68) 



and hence 



P{?7B > [j]} < P < <w(B c ) >l + \l J - + n 



+ P|cr min (B) <1- 



(69) 



T)j(A) is the maximum of ?7b on all (™) possible choices for 
B. Therefore, using the union bound, 



' {rij (A) >ri rir2 [?']}< 




U ??B > »?rir 2 b'] 

Be-Pj(A) 



which completes the proof of ( |64] i. We have also (65 1, because 
the events {^(A) > r) rir2 [j]} and {7, -(A) > -y rir2 [j]} are 
identical (while 1 — \fjjn — r 2 > 0). ■ 



Proof of Theorem^ From (23 1, if 7'(A) < j ri r 2 [£] then 
P — so 1 1 2 < (l , r 1 r 2 [£\) a s,i- This is equivalent to say that if 
P - S0II2 > (7rir 2 [^) a s,e then 7' (A) > j rir2 [£}. Therefore 

P{||s-s || 2 > (7r ir2 M)a M } <P{7(A) >7r 1 r 2 ^}. 

(70) 

By Lemma [4] y/m < 7nr 2 M an ^ hence 7' (A) > 7 ri r 2 M 
is equivalent to maxi<j<^ 7j(A) > 7 nr2 [£]. Therefore, from 
the union bound, 

i 

P{f (A) > lrir2 [£}} < ^P{ 7 ,(A) > lrir2 [l]} 

i=l 

i 

<^P{7i(A)>7n, 2 [i]} I (71) 

where in the last inequality, Lemma |4] has been used. Now, 
combining (70 1, (71 1 and (|65]> proves the theorem. ■ 



Proof of Theorem Let r = min(ri,r2_) and P = 
P{||s — s 1 1 2 > (7nr 2 [2p])a§,2 P }- Then, from (58 1 we have 



P < 2p 
Hence, from d66 



(™)(e-»* +e -^<<; 



3 -nr 2 /2 



Tfl TIT 

P < Ap ■ exp <j 2p In — + 2p — 



= 4p • cxp < — n 



r 2 2n / ,m 

1 + In — 

2 77 V 2p 



(72) 



(73) 
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When n grows to infinity, the coefficient of — n in the exponent 
converges to the constant r 2 /2 — u{l-\-hxv), which is positive 
by the assumption (61 1, and hence, P is upper bounded by an 



exponentially decreasing function. ■ 

VII. Conclusion 

In this paper, we studied upper bounds for the estimation 
error ||s — So || 2 ■ We saw that such bounds can be constructed 
only based on s, and without knowing So (the existence 
of a sparse So satisfying 1 1 So 1 1 < ^spark(A) has been 
assumed). We have also presented a tight upper bound for 
this error. Besides being tight, this bound does not impose any 
assumption on the normalization of the atoms of the dictionary, 
which enabled us to study random dictionaries (which are used 
e.g. in compressed sensing). 

As a result, our bounds guaranty that whenever s is only ap- 
proximately (not exactly) sparse, it would be not too far from 
So, and the upper bound on their distance is determined by the 
properties of the dictionary (A). This upper bound decreases 
also when s is sparse with a better approximation. In this point 
of view, our bounds can be seen as a generalization of the 
uniqueness theorem to the case § is only approximately sparse. 
Moreover, these bounds show that whenever ||so||o grows, to 
obtain a predetermined guaranty on the maximum of ||s— S0II2, 
s is needed to be sparse with a better approximation. This can 
be seen as an explanation to the fact that the estimation quality 
of sparse recovery algorithms degrades whenever 1 1 so 1 1 grows. 

We also studied the noisy case, and we saw that constructing 
a general upper bound for this case is not easy. Hence, we did 
not study random dictionaries for this noisy case, which can 
be a subject for future investigations. 

VIII. Appendix 

A. Proof of Lemma [2] 

We will need the following lemma: 

Lemma 5. Let B be an n x p matrix, p > n, with unit I 2 
norm columns. Then <7 max (B) > y 1 pjn > 1. 

Proof: The singular values of B are the square root of 
eigenvalues of C pxp = B T B. Moreover, since the columns 
of B have unit Euclidean norms, the main diagonal elements 
of C are all equal to 1. Therefore, X)f=i Ai(C) = tr(B) = p, 
where A;(C) denote the eigenvalues of C. On the other hand, 
the rank of C is at most n, and hence there are at most n 
nonzero A,'s. Therefore 



p 

p = ]TA 4 (C) <nK 

i=l 



c(C) => A max (C) > 



P 



which completes the proof. ■ 
Proof of Lemma^ From the definition ( [TS) , for j = 1, 
Ai has only one column and hence <7 m ; n (Ai) = 1 using 
Lemma [3] Moreover, Ai is an n x (to — 1) matrix. We write 

o-max( A i)) > V™> 



= 1 



71(A) >y/m& yj{m-l) (1 

_ m 
aax(Al) > 



& 1 



TO — 1 



m — 1 



<ax(Al) > 



which holds by Lemma |5j because cr^ lax (Ai) > 1 > ^ry- ■ 

B. Proof of Lemma [4] 

To prove that 7n r 2 [j] * s strictly increasing with respect to 
j, we state the following lemma, in which, we first define a 
function T(x), ieR, such that 7r 1 r 2 [j]' s are scaled samples 
of this function (more precisely r y ri r 2 [j} = V™ r(j/n)) for 
appropriate values of the parameters of the function. Then, 
we show that T(x) is itself strictly increasing, and hence so 
are its samples. 

Lemma 6. Let p, a, b be real numbers with a > 0, p > b 2 > 0. 
Then the function T(-), defined below, is strictly increasing on 
the in te rva I [0 , b 2 ) : 



\ 



ip-x) 



a + ^Jp - x 



(74) 



Before going to the proof, note that 7 ri r 2 b1 = V^^(j/ n ) 
for p = m/n, a = 1 + n and b = 1 — f2. 

Proof of Lemma |HJ We have to prove that g{x) = 
T 2 (x) is increasing on [0, b 2 ), and hence we have to 
prove that g'(x) > 0,Vx € [0, b 2 ). By defining h{x) = 

h 2 (x), and hence g'(x) = — 1 + 2h(x)h' (x). Consequently, 
we have to prove 2h(x)h'(x) > l,Va; G [0, b 2 ). Direct 
calculations show that 2h(x)h' (x) is equal to 

(a + y/p- x) -(a + 2^/p- x)(b - y/x) + ' B 5§{a + ^p - x) 



and hence 2h(x)h' (x) > 1 is equivalent to 

y/X 

+ (a + y/p- x)(a + 2^/p -x)(b- y/x). (75) 



To prove (75 1 we multiply both sides by yfx/[(a + 
y/p — x) 2 (b — yfx)\ and write it as 



p — x 



> \/x 



b- y/x 



a + y/p - x 



+ 1 



a + y/p - x 



Note that from p> b 2 we have 



p — x 



> 



= b + y/x, 



b — y/x b — yfx 

and hence to prove ( |76] > it is sufficient to prove that 

y/p- x 



b+y/x > yfx 



b - y/X 

a + y/p - x 



1 



a + y/p - x 



(76) 
(77) 

, (78) 



which by multiplying both sides by (a+y/p — x) is equivalent 
to 



b(a+y/p — x) 2 —yfx (b — y/x) 2 + y/p — x (a + y/p — x) 



TO — 1 



> 0. 
(79) 
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Fig. 4. A ty pical graph of T(x) defined in |74| and the values 7rxr 2 [i] 
defined in {56}. Note: Tj in the figure stands for 7 ri r 2 [j] — v^rfj'/n). 



Doing some algebraic manipulations, the left hand side of the 
above inequality is equal to 

a 2 b+ab^p — x+a^/p — x(b—y/x) + (b—y/x)(p—b\/x), (80) 

and hence ( f79] > holds because the first 3 terms of the above 
expression are nonnegative (note that a may be equal to zero), 
and the last term is positive from b > y/x and p > by/x 
(because p > b 2 = b.b > b^/x). ■ 

Proof of Lemma^ Note that 7nr 2 [j] = V™^ti/ n ) f° r 
p = mjn, a= 1 + ri and b — 1 — t^, where T(-) is as defined 
in (|74l. Now, since p=^>l>(l- r 2 ) 2 = b 2 > 0, the 
conditions of Lemma [6] are satisfied, and hence that lemma 
insures that 7nr 2 [j] is strictly increasing. 

To prove 7nr 2 [j] > y/m, we note that it is equivalent to 

(m - + V 2 ir2 [j}) >m^r 1 2 ir2 [j} > ' 



1 



which holds because f] 2 ir2 [j] > 1 and 1 > ;^ry- ■ 
Figure [4] shows a typical graph of T(x) and j rir3 [j] (denoted 
by Tj in the figure). 
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